Why sharding is bad for business

My current employer uses sharded and replicated Postgres via RDS. Even basic things like deploying schema changes to every shard are an unbelievable pain in the ass, and changing the number of shards on a major database is a high-touch, multi-day operation. After having worked with Spanner in the past, it’s like going back to the stone age, like we’re only a step above babysitting individual machines in a closet. Nobody should do this. — HackerNews user GeneralMayhem

Your business faces costs and risks when technology can’t evolve fast enough to meet changing business needs.

Economic conditions have been turbulent lately. For the near future, at least, the only thing certain is uncertainty. To stay ahead you need to be able to flexibly and responsively adapt your app to changes in your customers and the market. But that is challenging to do on a manually sharded database, which makes sharding bad for business.

The overwhelming majority of transactional applications are written on top of traditional relational databases like PostgreSQL. The fundamental flaw with an RDBMS, unfortunately, is scalability. When you’ve optimized your application code, caching, and query patterns and yet can’t squeeze any more performance out of your database layer, it’s time to scale up or scale out.

Scale up or scale out?

Scaling up (or vertically scaling) requires adding more CPU to handle more concurrent connections; adding RAM so more queries can run in memory instead of paging out to disk; and adding more disk to get more capacity for the data and indexes to speed up queries. Vertical scaling can get very expensive, and creates an increasingly unwieldy single point of failure for your application or service, eventually reaching a limit that requires the adoption of sharding.

Scaling out (or horizontally scaling) a database requires adding more database nodes that each take part of the workload. The benefits are immediate because the split workloads improve performance. And if an individual node goes down for any reason, its load can fail over to the others so that parts (maybe even all) of your database can stay online.

Sharding the database is a common solution in scale-up scenarios. Splitting data can help reduce index growth and reduce write locking on a single instance. Sharding is also, however, the only way to achieve a writeable scale-out database for the vast majority of PostgreSQL users.

Sharding PostgreSQL for horizontal scale

Conceptually, sharding is simple. You effectively split your database into multiple separate databases. And with multiple databases come multiple writable primary nodes. Voila, you’ve just created new headroom in your previously at-capacity database! Unfortunately…

Functionally, sharding is complex. When our application was written with the assumption that there’s one database and everything lives in it. The instant you shard a single relational database into two or more databases, however, every place your application touches data will now require editing or rewriting — along with writing additional new application code.

How to build a payments system that scales to infinity (with examples)

This new application code is required because sharding forces the need to embed data logic in the application itself, so that the app “knows” where to source and send data. Typical new application code necessary to function with a sharded database includes (but is not limited to):

Designing a table sharding (partitioning) model
Developing application logic to route data to the correct shard
Developing application logic to perform joins of data across multiple table shards
Developing application logic to perform atomic transactions across multiple table shards or designing the application to avoid needing cross-shard atomic transactions
Developing application logic to handle failure scenarios where some shards or partitions are not available

As you can see, these additions are all seriously manual in terms of coding execution, requiring a lot of extra work while adding a lot of new complexity to the application. This burden is bad enough, but it’s not even the biggest reason that sharding is bad for business.

Complexity does not end once the sharding is done. Actually living with a manually sharded database is downright painful for your developers. Why? Basic operational tasks like backing up, adding indexes, and changing schema have now become more difficult. Repartitioning, rebalancing, skewed usage, cross-shard reporting, and partitioned analytics are other issues you need to deal with.

Ultimately, though, the biggest day-to-day challenge is application development within the confines of a sharded database, where global indexes and cross-shard transactions can be easily lost. Your dev team is afraid to touch the app now due to the very real risk of accidentally breaking things — big, important things. The result is a continual tax on development speed.

The REAL reason sharding is bad for business

This brings us to the real reason manually sharding to scale out a database you can no longer scale up is bad: It is expensive, in every sense of the word.

Development and operational costs. Deploying an application on sharded PostgreSQL is difficult and cumbersome to live with in production.

Business cost. Living with a manually sharded database means that every time you need to make a change to the application — and valuable applications changes — your developers have to reshard, rebalance, and repartition the database. Then, once again, every place your application touches data must be re-edited or rewritten. This is expensive in both time and resources, and incurs a huge — although frequently unrecognized — business cost.

Speed and agility cost. To evolve (or sometimes even simply scale) an application built on a manually sharded PostgreSQL database, developers regularly need to change all that sharding code, and they’re deeply aware of how easy it is to break things when they do so. When your developers are anxious about iterating (“I don’t wanna mess with the app because if I do I’m going to break something!”), your business is held back.

With a manually sharded database, change becomes risky because sharding introduces a lot of database logic into your application code. Sharding means accepting that there’s a built-in limit on your development speed, which introduces additional cost and risk to the business.

Fundamentally, technology and software exist to support the business. If the application software that you’re building can’t move fast enough or handle the scale you need, that introduces significant cost and risk.

When it’s part of the database itself, all those problems just… disappear. It’s like a single database that just happens to spread itself across multiple machines. It’s an actual distributed cloud solution.

Long-term costs of sharding

Manual sharding is never a “one and done” event.

Many people overlook the long-term reality of living with a manually sharded database. They think, “It’ll just be three FTEs for six months to migrate the app.” But those developers are here to stay. Any time you want to change the application, they’re the ones who know all the places it needs to change.

They may not work full time on it, but if they were to leave, application development speed would slow down even more — so now you add the cost of spending significant energy and resources to retain these specific people.

This is why refactoring your application to work with your now-sharded database is not the biggest difficulty or expense. Updating and changing your application (i.e. operational maintenance) is the real problem, and one that only grows with time. Any time you want to change schemas, rebalance, change database properties, move internal partitions between database instances, your developers have to update the application logic to account for it. And they have to do this manually, every time, every single place the app touches data.

Let your database do the sharding for you

Companies that truly understand the overhead and opportunity costs of manual sharding — both startups and large enterprise organizations — have given up on this slow and painstaking old way. They have switched to distributed SQL databases, which offer built-in horizontal scalability:

In a world where things like Dynamo/Cassandra or Spanner/CockroachDB exist, manually-sharded DB solutions are pretty much entirely obsolete. Spanner exists because Google got so sick of people building and maintaining bespoke solutions for replication and resharding, which would inevitably have their own set of quirks, bugs, consistency gaps, scaling limits, and manual operations required to reshard or rebalance from time to time.

When it’s part of the database itself, all those problems just… disappear. It’s like a single database that just happens to spread itself across multiple machines. It’s an actual distributed cloud solution. — HackerNews user GeneralMayhem

If you are hitting the limit with your current RDBMS and trying to figure out how to scale up, it’s time to start thinking along these same lines. Why are you not using a database that does the sharding for you? One that is purpose-built on distributed architecture for consistency, reliability and native horizontal scalability?

When you’re paying for a solution that is supposed to run your database for you, why are you the one who has to figure out how to scale the darn thing?